1 Exercise 2

1.1 Answer these questions from the data.

1.1.1 How many teams in the competition?

There are 14 teams in the competition.

1.1.2 How many players?

There are a total of 370 players.

1.1.3 How many rounds in the competition?

There are a total of 7 rounds in the competition.

1.2 The 2020 season was interrupted by COVID, so there was no winning team. Make an appropriate plot of the goals by team and suggest which team might have been likely to win if the season had played out.

As it can be observed, the highest goal scorers are team Kangaroos and team Fremantle. Therefore, they are more likely to win the 2020 season.

1.3 If you were to make a pairs plot of the numeric variables, how many plots would you need to make? (DON’T MAKE THE PLOT!!!)

The dataset contains 68 variables and out of which 34 are numeric variables. Since the pairs plot shows the distribution between single variables and between 2 variables, the total pair plots that can be made will be 34 * 34 = 1156. However, the variable jumper id has been duplicated thrice which makes it 31 * 31 = 961. Total would be 528 which comprises of the number of diagonals (433), upper and lower triangles.

1.4 Summarise the players, by computing the means for all of the statistics. On this data, one pair of variables variables has an L-shaped pattern. (See the slides from week 7 if you need a reminder what this shape is.) Use scagnostics to find the pair. Make the plot, report the scagnostic used. Write a sentence to explain the relationship between the two variables, in terms of players skills.

The Scagnostics striated and stringy were used to arrive at the L-shaped plots. Since striated checks the straightness of the points and stringy checks the dispersion. This yielded the variables hitputs and bounces.

1.5 Find a pair of variables that exhibit a barrier. Plot it and report the scagnostic used. Write sentence explaining the relationship.

The data seemed to have a barrier where in the value does not go beyond a certain x,y value.

1.6 Writing code similar to that in lecture 7B, make an interactive plotly parallel coordinate plot of the scagnostics. You can also refer to the plotly website to work out some of the difficult parts. There are two pieces that are really important to have:

1.6.1 scale on each axis needs to be 0-1, not individual variable range

1.6.2 the text outputted when traces are selected should include the pair of variables with that set of scagnostic values.

# Shiny

ui <- fluidPage(
  plotlyOutput("parcoords"),
  verbatimTextOutput("data"))


server <- function(input, output, session) { 
  
  aflw_num <- aflw_scags[,3:15]
  
output$parcoords <- renderPlotly({ 
  dims <- Map(function(x, y) {
      list(values = x,
           range = range(0,1), 
           label = y)
    
    }, aflw_num, 
    names(aflw_num), 
    USE.NAMES = FALSE)
  
    plot_ly(type = 'parcoords', 
            dimensions = dims, 
            source = "pcoords") %>% 
      layout(margin = list(r = 30)) %>%
      event_register("plotly_restyle")
})

ranges <- reactiveValues()
  observeEvent(event_data("plotly_restyle", 
                          source = "pcoords"),
  {
    d <- event_data("plotly_restyle", 
                    source = "pcoords")
    
    dimension <- as.numeric(stringr::str_extract(names(d[[1]]),"[0-9]+"))
    
    
    if (!length(dimension)) return()
    
    dimension_name <- names(aflw_numeric)[[dimension + 1]]
    
    info <- d[[1]][[1]]
    ranges[[dimension_name]] <- if (length(dim(info)) == 3) {
      lapply(seq_len(dim(info)[2]), function(i) info[,i,])
    } else {
      list(as.numeric(info))
    }
  })
  
  aflw_selected <- reactive({
    keep <- TRUE
    for (i in names(ranges)) {
      range_ <- ranges[[i]]
      keep_var <- FALSE
      for (j in seq_along(range_)) {
        rng <- range_[[j]]
        keep_var <- keep_var | dplyr::between(aflw_scags[[i]], 
                                              min(rng), max(rng))
      }
      keep <- keep & keep_var
    }
    aflw_scags[keep, ]
  })
  
  output$data <- renderPrint({
    tibble::as_tibble(aflw_selected())
  })
}


shinyApp(ui, server)

1.6.3 Summarise the relationships between the scagnostics, in terms of positive and negative association, outliers, clustering.

Clumpy and Covex have relatively lower values when compared to the rest. There seems to be outliers in convex, skinny and clumpy data. Sparse and Skewed show clumpiness while the others are more spreadout.

1.6.4 Pairs that have high values on convex (non-zero) tend to have what type of values on outlying, stringy, striated, skewed, skinny and splines?

Outlying: 0.0 - 0.2 Stringy: 0.6 Straited: 0.2 - 0.8 Skewed: 0.7 Skinny: 0.4 Splines: 0.5

1.6.5 Pairs of variables that have high values on skewed tend to have what type of values on outlying, stringy, striated, and splines?

Outlying: > 0.4 Stringy, Striated: > 0.8 Splines: 0

1.6.6 Identify one pair of variables that might be considered to have an unusual combination of scagnostic values, ie is an outlier in the scagnostics.

Clumpy and Convex